STOMP: confirm utf-8 handling (backport #13858) #13860
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an intermediate conclusion/confirmation that out STOMP implementation can handle multi-byte characters in utf-8 encoding.
The question came up during Native STOMP review.
Frame parser collects bytes one by one into list and before transitioning to the next state reverses that
acc
list, so multi-bytes characters represented here with respective number of integers less than 255. In tests and in our code we work with headers via Erlang string literals that (at least with default source file encoding) accept unicode just fine and use utf8 as encoding. The tricky part here is that string literals are encoded as list of integers, not as list of bytes:"headꙕr1"
becomes[104,101,97,100,42581,114,49]
.Binary literals without encoding:
and with:
This last one list is exactly the list we get in frame parser.
It was confusing at the beginning until I realized I mostly fighting Erlang in tests. Newly added python test simply confirms utf-8 stuff relayed just fine. As for standard headers and our 'x-' extensions they all fit into ASCII so no problem here when we do look-ups for them using
stomp_frame:header
.Bottom line:
we relay utf8 just fine, if we keep default encoding for our source files, our string literals in the code keep working.
PS.
Curiously, erlang's list_to_binary doesn't work with utf8 strings (
unicode
module must be used):I don't know yet if it means something for us outside STOMP, but in terms of unicode list_to_binary should be replaced with
unicode:characters_to_binary
:However, all our protocol strings fit into first 128 ASCII codes so like we are just fine.
This is an automatic backport of pull request #13858 done by [Mergify](https://mergify.com).